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ABSTRACT 

From social networks to P2P systems, network sampling 
arises in many settings. We present a detailed study on 
the nature of biases in network sampling strategies to shed 
light on how best to sample from networks. We investigate 
connections between specific biases and various measures of 
structural representativeness. We show that certain biases 
are, in fact, beneficial for many applications, as they "push" 
the sampling process towards inclusion of desired properties. 
Finally, we describe how these sampling biases can be ex- 
ploited in several, real-world applications including disease 
outbreak detection and market research. 

Categories and Subject Descriptors 

H. 2.8 [Database Management]: Database Applications — 
Data Mining 

General Terms 

Algorithms; Experimentation, Measurement 

Keywords 

sampling, bias, social network analysis, complex networks, 
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I. INTRODUCTION AND MOTIVATION 

We present a detailed study on the nature of biases in net- 
work sampling strategies to shed light on how best to sample 
from networks. A network is a system of interconnected en- 
tities typically represented mathematically as a graph: a set 
of vertices and a set of edges among the vertices. Networks 
are ubiquitous and arise across numerous and diverse do- 
mains. For instance, many Web-based social media, such 
as online social networks, produce large amounts of data 
on interactions and associations among individuals. Mobile 
phones and location-aware devices produce copious amounts 
of data on both communication patterns and physical prox- 
imity between people. In the domain of biology also, from 
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neurons to proteins to food webs, there is now access to large 
networks of associations among various entities and a need 
to analyze and understand these data. 

With advances in technology, pervasive use of the Internet, 
and the proliferation of mobile phones and location-aware 
devices, networks under study today are not only substan- 
tially larger than those in the past, but sometimes exist in 
a decentralized form (e.g. the network of blogs or the Web 
itself). For many networks, their global structure is not 
fully visible to the public and can only be accessed through 
"crawls" (e.g. online social networks). These factors can 
make it prohibitive to analyze or even access these networks 
in their entirety. How, then, should one proceed in analyzing 
and mining these network data? One approach to address- 
ing these issues is sampling: inference using small subsets of 
nodes and links from a network. 

From epidemiological applications 13 to Web crawling [7] 
and P2P search [47], network sampling arises across many 
different settings. In the present work, we focus on a partic- 
ular line of investigation that is concerned with constructing 
samples that match critical structural properties of the orig- 
inal network. Such samples have numerous applications in 
data mining and information retrieval. In [29], for example, 
structurally-representative samples were shown to be effec- 
tive in inferring network protocol performance in the larger 
network and significantly improving the efficiency of proto- 
col simulations. In Section [7] we discuss several additional 
applications. Although there have been a number of re- 
cent strides in work on network sampling (e.g. [5"1|25II29II34| ). 
there is still very much that requires better and deeper un- 
derstanding. Moreover, many networks under analysis, al- 
though treated as complete, are, in fact, samples due to lim- 
itations in data collection processes. Thus, a more refined 
understanding of network sampling is of general importance 
to network science. Towards this end, we conduct a detailed 
study on network sampling biases. There has been a recent 
spate of work focusing on problems that arise from network 
sampling biases including how and why biases should be 
avoided [H[ISl[2SlI3niIlSIIlZ] • Our work differs from much of 
this existing literature in that, for the first time in a compre- 
hensive manner, we examine network sampling bias as an as- 
set to be exploited. We argue that biases of certain sampling 
strategies can be advantageous if they "push" the sampling 
process towards inclusion of specific properties of interest □ 
Our main aim in the present work is to identify and under- 
stand the connections between specific sampling biases and 



x This is similar to the role of bias in stratified sampling in 
classical statistics. 



specific definitions of structural representativeness, so that 
these biases can be leveraged in practical applications. 
Summary of Findings. We conduct a detailed investiga- 
tion of network sampling biases. We find that bias towards 
high expansion (a concept from expander graphs) offers sev- 
eral unique advantages over other biases such as those to- 
ward high degree nodes. We show both empirically and an- 
alytically that such an expansion bias "pushes" the sampling 
process towards new, undiscovered clusters and the discov- 
ery of wider portions of the network. In other analyses, 
we show that a simple sampling process that selects nodes 
with many connections from those already sampled is often 
a reasonably good approximation to directly sampling high 
degree nodes and locates well-connected (i.e. high degree) 
nodes significantly faster than most other methods. We also 
find that the breadth-first search, a widely-used sampling 
and search strategy, is surprisingly among the most dismal 
performers in terms of both discovering the network and 
accumulating critical, well-connected nodes. Finally, we de- 
scribe ways in which some of our findings can be exploited 
in several important applications including disease outbreak 
detection and market research. A number of these afore- 
mentioned findings are surprising in that they are in stark 
contrast to conventional wisdom followed in much of the ex- 
isting literature (e.g. [2l|l3 ll30ll4Tll42] ) . 

2. RELATED WORK 

Not surprisingly, network sampling arises across many di- 
verse areas. Here, we briefly describe some of these different 
lines of research. 

Network Sampling in Classical Statistics. The concept 
of sampling networks first arose to address scenarios where 
one needed to study hidden or difficult-to-access populations 
(e.g. illegal drug users, prostitutes). For recent surveys, one 
might refer to |191I28] . The work in this area focuses almost 
exclusively on acquiring unbiased estimates related to vari- 
ables of interest attached to each network node. The present 
work, however, focuses on inferring properties related to the 
network itself (many of which are not amenable to being 
fully captured by simple attribute frequencies). Our work, 
then, is much more closely related to representative subgraph 
sampling. 

Representative Subgraph Sampling. In recent years, a 
number of works have focused on representative subgraph 
sampling: constructing samples in such a way that they 
are condensed representations of the original network (e.g. 
P [25l[29ll32ll34] ). Much of this work focuses on how best to 
produce a "universal" sample representative of all structural 
properties in the original network. By contrast, we subscribe 
to the view that no single sampling strategy may be appro- 
priate for all applications. Thus, our aim, then, is to better 
understand the biases in specific sampling strategies to shed 
light on how best to leverage them in practical applications. 
Unbiased Sampling. There has been a relatively recent 
spate of work (e.g. 20,22,47 ) that focuses on constructing 
uniform random samples in scenarios where nodes cannot 
be easily drawn randomly (e.g. settings such as the Web 
where nodes can only be accessed through crawls). These 
strategies, often based on modified random walks, have been 
shown to be effective for various frequency estimation prob- 
lems (e.g. inferring the proportion of pages of a certain lan- 
guage in a Web graph [22] )• However, as mentioned above, 
the present work focuses on using samples to infer structural 



(and functional) properties of the network itself. In this re- 
gard, we found these unbiased methods to be less effective 
during preliminary testing. Thus, we do not consider them 
and instead focus our attention on other more appropriate 
sampling strategies (such as those mentioned in representa- 
tive subgraph sampling). 

Studies on Sampling Bias. Several studies have inves- 
tigated biases that arise from various sampling strategies 
(e.g. [U[T5l[30l[3Tl[46]). For instance, g6] showed that, un- 
der the simple sampling strategy of picking nodes at ran- 
dom from a scale-free network (i.e. a network whose degree 
distribution follows the power law), the resultant subgraph 
sample will not be scale- free. The authors of [T][3l] showed 
the converse is true under traceroute sampling. Virtually all 
existing results on network sampling bias focus on its neg- 
ative aspects. By contrast, we focus on the advantages of 
certain biases and ways in which they can be exploited in 
network analysis. 

Property Testing. Work on sampling exists in the fields 
of combinatorics and graph theory and is centered on the 
notion of property testing in graphs 38 . Properties such as 
those typically studied in graph theory, however, may be less 
useful for the analysis of real-world networks (e.g. the exact 
meaning of, say, fc-colorability [3S] within the context of a 
social network is unclear). Nevertheless, theoretical work on 
property testing in graphs is excellently surveyed in [38] . 
Other Areas. Decentralized search (e.g. searching un- 
structured P2P networks) and Web crawling can both be 
framed as network sampling problems, as both involve mak- 
ing decisions from subsets of nodes and links from a larger 
network. Indeed, network sampling itself can be viewed as a 
problem of information retrieval, as the aim is to seek out a 
subset of nodes that either individually or collectively match 
some criteria of interest. Several of the sampling strategies 
we study in the present work, in fact, are graph search al- 
gorithms (e.g. breadth-first search). Thus, a number of our 
findings discussed later have implications for these research 
areas (e.g. see [39] )• For reviews on decentralized search 
both in the contexts of complex networks and P2P systems, 
one may refer to [27] and [49] , respectively. For examples of 
connections between Web crawling and network sampling, 
see [7H42]. 

3. PRELIMINARIES 

3.1 Notations and Definitions 

We now briefly describe some notations and definitions 
used throughout this paper. 

Definition 1. G — (V, E) is a network or graph where V 
is set of vertices and E C V x V is a set of edges. 

Definition 2. A sample S is a subset of vertices, S C V. 

Definition 3. N(S) is the neighborhood of S if N(S) = 
{w G V - S : 3v€S s.t. (v, w) G E}. 

Definition 4- Gs is the induced subgraph of G based on 
the sample S if Gs = (S, Es) where the vertex set is S C V 
and the edge set is Es = (S x S)f~)E. The induced subgraph 
of a sample may also be referred to as a subgraph sample. 



3.2 Datasets 

We study sampling biases in a total of twelve different net- 
works: a power grid (PowerGrid [SD]), a Wikipedia voting 
network (WikiVote p]). a PGP trust network (PGP 0), 
a citation network (HEPTh 33 ), an email network (En- 
ron [33]), two co-authorship networks (CondMat [33] an d 
AstroPh [33]), two P2P file-sharing networks (Gnutella04 
|33j and Gnutella31 [33] )i t w0 online social networks (Epin- 
ions [33] and Slashdot [33] ), and a product co-purchasing 
network (Amazon [33]). These datasets were chosen to rep- 
resent a rich set of diverse networks from different domains. 
This diversity allows a more comprehensive study of net- 
work sampling and thorough assessment of the performance 
of various sampling strategies in the face of varying network 
topologies. Table [1] shows characteristics of each dataset. 
All networks are treated as undirected and unweighted. 



Network 


N 


D 


PL 


CO 


AD 


PowerGrid 


4941 


0.0005 


19 


0.11 


2.7 


WikiVote 


7066 


0.004 


3.3 


0.21 


28.5 


PGP 


10,680 


0.0004 


7.5 


0.44 


4.6 


Gnutella04 


10,876 


0.0006 


4.6 


0.01 


7.4 


AstroPh 


17,903 


0.0012 


4.2 


0.67 


22.0 


CondMat 


21,363 


0.0004 


5.4 


0.70 


8.5 


HEPTh 


27,400 


0.0009 


4.3 


0.34 


25.7 


Enron 


33,696 


0.0003 


4.0 


0.71 


10.7 


Gnutella31 


62,561 


0.00008 


5.9 


0.01 


4.7 


Epinions 


75,877 


0.0001 


4.3 


0.26 


10.7 


Slashdot 


82,168 


0.0001 


4.1 


0.10 


12.2 


Amazon 


262,111 


0.00003 


8.8 


0.43 


6.9 



Table 1: Network Properties. Key: N— # of nodes, D= 
density, PL = characteristic path length, CC = local clus- 
tering coefficient, AD = average degree. 



4. NETWORK SAMPLING 

In the present work, we focus on a particular class of sam- 
pling strategies, which we refer to as link-trace sampling. In 
link-trace sampling, the next node selected for inclusion into 
the sample is always chosen from among the set of nodes 
directly connected to those already sampled. In this way, 
sampling proceeds by tracing or following links in the net- 
work. This concept can be defined formally. 

Definition 5. Given an integer k and an initial node (or 
seed) v £ V to which S is initialized (i.e. S = {«}), a link- 
trace sampling algorithm, A, is a process by which nodes are 
iteratively selected from among the current neighborhood 
N(S) and added to S until \S\ = k. 

Link-trace sampling may also be referred to as crawling 
(since links are "crawled" to access nodes) or viewed as on- 
line sampling (since the network G reveals itself iteratively 
during the course of the sampling process). The key advan- 
tage of sampling through link-tracing, then, is that complete 
access to the network in its entirety is not required. This 
is beneficial for scenarios where the network is either large 
(e.g. an online social network), decentralized (e.g. an un- 
structured P2P network), or both (e.g. the Web). 

As an aside, notice from Definition [5] that we have im- 
plicitly assumed that the neighbors of a given node can be 
obtained by visiting that node during the sampling process 
(i.e. N(S) is known). This, of course, accurately character- 
izes most real scenarios. For instance, neighbors of a Web 



page can be gleaned from the hyperlinks on a visited page 
and neighbors of an individual in an online social network 
can be acquired by viewing (or "scraping") the friends list. 

Having provided a general definition of link-trace sam- 
pling, we must now address which nodes in N(S) should be 
preferentially selected at each iteration of the sampling pro- 
cess. This choice will obviously directly affect the properties 
of the sample being constructed. We study seven different 
approaches - all of which are quite simple yet, at the same 
time, ill-understood in the context of real-world networks. 
Breadth-First Search (BFS). Starting with a single seed 
node, the BFS explores the neighbors of visited nodes. At 
each iteration, it traverses an unvisited neighbor of the ear- 
liest visited node [14]. In both [30] and [42], it was em- 
pirically shown that BFS is biased towards high-degree and 
high-PageRank nodes. BFS is used prevalently to crawl and 
collect networks (e.g. |41|). 

Depth-First Search (DFS). DFS is similar to BFS, ex- 
cept that, at each iteration, it visits an unvisited neighbor 
of the most recently visited node Q3] ■ 

Random Walk (RW). A random walk simply selects the 
next hop uniformly at random from among the neighbors of 
the current node [37j . 

Forest Fire Sampling (FFS). FFS, proposed in [3"4], is 
essentially a probabilistic version of BFS. At each iteration 
of a BFS-like process, a neighbor v is only explored according 
to some "burning" probability p. At p = 1, FFS is identical 
to BFS. We use p = 0.7, as recommended in [34j . 
Degree Sampling (DS). The DS strategy involves greedily 
selecting the node v £ N(S) with the highest degree (i.e. 
number of neighbors). A variation of DS was analytically 
and empirically studied as a P2P search algorithm in [2J. 
Notice that, in order to select the node v £ N(S) with the 
highest degree, the process must know |AT({v})| for each 
v £ N(S). That is, knowledge of N(N(S)) is required at 
each iteration. As noted in [2J, this requirement is acceptable 
for some domains such as P2P networks and certain social 
networks. The DS method is also feasible in scenarios where 
1) one is interested in efficiently "downsampling" a network 
to a connected subgraph, 2) a crawl is repeated and history 
of the last crawl is available, or 3) the proportion of the 
network accessed to construct a sample is less important. 
SEC (Sample Edge Count). Given the currently con- 
structed sample S, how can we select a node v £ N(S) with 
the highest degree without having knowledge of N(N(S))7 
The SEC strategy tracks the links from the currently con- 
structed sample S to each node v £ N(S) and selects the 
node v with the most links from S. In other words, we 
use the degree of v in the induced subgraph of S U {v} as 
an approximation of the degree of v in the original network 
G. Similar approaches have been employed as part of Web 
crawling strategies with some success (e.g. [9]). 
XS (Expansion Sampling). The XS strategy is based on 
the concept of expansion from work on expander graphs and 
seeks to greedily construct the sample with the maximal ex- 
pansion: argmax s . \s\=k ^"TgT^ > where k is the desired sample 
size [231140] . At each iteration, the next node v selected for 
inclusion in the sample is chosen based on the expression: 

argmax \N({v}) - (N(S) U S)\. 

vSN(S) 

Like the DS strategy, this approach utilizes knowledge of 
N(N(S)). In Sections 15.31 and 16.21 we will investigate in 



detail the effect of this expansion bias on various properties 
of constructed samples. 



the top K nodes accumulated by the sample. For our tests, 
we use K — 100. 



5. EVALUATING REPRESENTATIVENESS 

What makes one sampling strategy "better" than another? 
In computer science, "better" is typically taken to be struc- 
tural representativeness (e.g. see 25,29,35 ). That is, sam- 
ples are considered better if they are more representative of 
structural properties in the original network. There are, 
of course, numerous structural properties from which to 
choose, and, as correctly observed by Ahmed et al. [4j, it 
is not always clear which should be chosen. Rather than 
choosing arbitrary structural properties as measures of rep- 
resentativeness, we select specific measures of representa- 
tiveness that we view as being potentially useful for real ap- 
plications. We divide these measures (described below) into 
three categories: Degree, Clustering, and Reach. For each 
sampling strategy, we generate 100 samples using randomly 
selected seeds, compute our measures of representativeness 
on each sample, and plot the average value as sample size 
grows. (Standard deviations of computed measures are dis- 
cussed in Section 15.41 Applications for these measures of 
representativeness are discussed later in Section [7]) Due to 
space limitations and the large number of networks evalu- 
ated, for each evaluation measure, we only show results for 
two datasets that are illustrative of general trends observed 
in all datasets. However, full results are available as supple- 
mentary material^ 

5.1 Degree 

The degrees (numbers of neighbors) of nodes in a network 
is a fundamental and well-studied property. In fact, other 
graph-theoretic properties such as the average path length 
between nodes can, in some cases, be viewed as byproducts 
of degree (e.g. short paths arising from a small number of 
highly-connected hubs that act as conduits [5]). We study 
two different aspects of degree (with an eye towards real- 
world applications, discussed in Section [7]). 

5.1.1 Measures 

Degree Distribution Similarity (DistSim). We take the 
degree sequence of the sample and compare it to that of the 
original network using the two-sample Kolmogorov-Smirnov 
(K-S) D-statistic [34], a distance measure. Our objective 
here is to measure the agreement between the two degree dis- 
tributions in terms of both shape and location. Specifically, 
the D-statistic is defined as D — max x {\F(x) — Fs(x)\}, 
where x is the range of node degrees, and F and Fs are 
the cumulative degree distributions for G and Gs, respec- 
tively [34]. We compute the distribution similarity by sub- 
tracting the K-S distance from one. 

Hub Inclusion (Hubs). In several applications, one cares 
less about matching the overall degree distribution and more 
about accumulating the highest degree nodes into the sample 
quickly (e.g. immunization strategies [3]). For these sce- 
narios, sampling is used as a tool for information retrieval. 
Here, we evaluate the extent to which sampling strategies 
accumulate hubs (i.e. high degree nodes) quickly into the 
sample. As sample size grows, we track the proportion of 



5.1.2 Results 

Figure [T] shows the degree distribution similarity (Dist- 
Sim) and hub inclusion (Hubs) for the Slashdot and Enron 
datasets. Note that the SEC and DS strategies, both of 
which are biased to high degree nodes, perform best on hub 
inclusion (as expected), but are the worst performers on the 
DistSim measure (which is also a direct result of this bias) . 
(The XS strategy exhibits a similar trend but to a slightly 
lesser extent.) On the other hand, strategies such as BFS, 
FFS, and RW tend to perform better on DistSim, but worse 
on Hubs. For instance, the DS and SEC strategies locate 
the majority of the top 100 hubs with sample sizes less than 
1% in some cases. BFS and FFS require sample sizes of over 
10% (and the performance differential is larger when locat- 
ing hubs ranked higher than 100). More importantly, no 
strategy performs best on both measures. This, then, sug- 
gests a tension between goals: constructing small samples of 
the most well-connected nodes is in conflict with producing 
small samples exhibiting representative degree distributions. 
More generally, when selecting sample elements, choices re- 
sulting in gains for one area can result in losses for another. 
Thus, these choices must be made in light of how samples 
will be used - a subject we discuss in greater depth in Sec- 
tion [7] We conclude this section by briefly noting that the 
trend observed for SEC seems to be somewhat dependent 
upon the quality and number of hubs actually present in a 
network (relative to the size of the network, of course). That 
is, SEC matches DS more closely as degree distributions ex- 
hibit longer and denser tails (as shown in Figure [2]). We will 
revisit this in Section [6.31 (Other strategies are sometimes 
affected similarly, but the trend is much less consistent.) In 
general, we find SEC best matches DS performance on many 
of the social networks (as opposed to technological networks 
such as the PowerGrid with few "good" hubs, lower average 
degree, and longer path lengths). However, further investi- 
gation is required to draw firm conclusions on this last point. 
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Figure 1: Evaluating (DistSim) and (Hubs). 
remaining networks are similar. 



Results for 



Supplementary material for this paper is available at: 
http : //arun. maiya.net /papers/ supp-netbias .pdf 



5.2 Clustering 

Many real-world networks, such as social networks, exhibit 
a much higher level clustering than what one would expect 
at random [50]. Thus, clustering has been another graph 
property of interest for some time. Here, we are interested 
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Figure 2: Performance of SEC on Hubs (shown in green 
on left) is observed to be dependent on the tail of the de- 
gree distributions (DD). SEC matches DS more closely when 
more and better quality hubs are present. For Hubs, SEC 
generally performs best on the social networks evaluated. 

in evaluating the extent to which samples exhibit the level 
of clustering present in the original network. We employ two 
notions of clustering, which we now describe. 

5.2.1 Measures 

Local Clustering Coefficient (CCloc). The local clus- 
tering coefficient [43] of a node captures the extent to which 
the node's neighbors are also neighbors of each other. For- 
mally, the local clustering coefficient of a node is defined 
as Cl{v) = d where d v is the degree of node v and 

I is the number of links among the neighbors of v. The 
average local clustering coefficient for a network is simply 

£„EV C L (v) 

\v\ 

Global Clustering Coefficient (CCglb). The global 
clustering coefficient 43 is a function of the number of tri- 
angles in a network. It is measured as the number of closed 
triplets divided by the number of connected triples of nodes. 

5.2.2 Results 

Results for clustering measures are less consistent than 
for other measures. Overall, DFS and RW strategies appear 
to fare relatively better than others. We do observe that, 
for many strategies and networks, estimates of clustering 
are initially higher-than-actual and then gradually decline 
(see Figure |3J). This agrees with intuition. Nodes in clus- 
ters should intuitively have more paths leading to them and 
will, thus, be encountered earlier in a sampling process (as 
opposed to nodes not embedded in clusters and located in 
the periphery of a network). This, then, should be taken 
into consideration in applications where accurately match- 
ing clustering levels is important. 

5.3 Network Reach 

We propose a new measure of representativeness called 
network reach. As a newer measure, network reach has ob- 
viously received considerably less attention than Degree and 
Clustering within the existing literature, but it is, neverthe- 
less, a vital measure for a number of important applications 
(as we will see in Section [TJl . Network reach captures the 
extent to which a sample covers a network. Intuitively, for 
a sample to be truly representative of a large network, it 



Figure 3: Evaluating CCglb and CCloc. 

should consist of nodes from diverse portions of the net- 
work, as opposed to being relegated to a small "corner" of 
the graph. This concept will be made more concrete by 
discussing in detail the two measures of network reach we 
employ: community reach and the discovery quotient. 

5.3.1 Measures 

Community Reach (CNM and RAK). Many real- world 
networks exhibit what is known as community structure. A 
community can be loosely defined as a set of nodes more 
densely connected among themselves than to other nodes in 
the network. Although there are many ways to represent 
community structure depending on various factors such as 
whether or not overlapping is allowed, in this work, we rep- 
resent community structure as a partition: a collection of 
disjoint subsets whose union is the vertex set V [18]. Under 
this representation, each subset in the partition represents 
a community. The task of a community detection algorithm 
is to identify a partition such that vertices within the same 
subset in the partition are more densely connected to each 
other than to vertices in other subsets [18| . For the criterion 
of community reach, a sample is more representative of the 
network if it consists of nodes from more of the communi- 
ties in the network. We measure community reach by taking 
the number of communities represented in the sample and 
dividing by the total number of communities present in the 
original network. Since a community is essentially a cluster 
of nodes, one might wonder why we have included commu- 
nity reach as a measure of network reach, rather than as a 
measure of clustering. The reason is that we are slightly 
less interested in the structural details of communities de- 
tected here. Rather, our aim is to assess how "spread out" 
a sample is across the network. Since community detec- 
tion is somewhat of an inexact science (e.g. see [21]), we 
measure community reach with respect to two separate al- 
gorithms. We employ both the method proposed by Clauset 
et al. in [12] (denoted as CNM) and the approach proposed 
by Raghavan et al. in [45] (denoted as RAK). Essentially, 
for our purposes, we are defining communities simply as the 
output of a community detection algorithm. 
Discovery Quotient (DQ). An alternative view of net- 
work reach is to measure the proportion of the network that 
is discovered by a sampling strategy. The number of nodes 
discovered by a strategy is defined as \S U iV(£>)|. The dis- 
covery quotient is this value normalized by the total number 
of nodes in a network: SL jy | ^ ■ Intuitively, we are defin- 



ing the reach of a sample here by measuring the extent to 
which it is one hop away from the rest of the network. As we 
will discuss in Section [JJ samples with high discovery quo- 
tients have several important applications. Note that a sim- 
ple greedy algorithm for coverage problems such as this has 
a well-known sharp approximation bound of 1 — 1/e 16,39 . 
However, link-trace sampling is restricted to selecting subse- 
quent sample elements from the current neighborhood N(S) 
at each iteration, which results in a much smaller search 
space. Thus, this approximation guarantee can be shown 
not to hold within the context of link-trace sampling. 

5.3.2 Results 

As shown in Figure [4j the XS strategy displays the over- 
whelmingly best performance on all three measures of net- 
work reach. We highlight several observations here. First, 
the extent to which the XS strategy outperforms all others 
on the RAK and CNM measures is quite striking. We posit 
that the expansion bias of the XS strategy "pushes" the sam- 
pling process towards the inclusion of new communities not 
already seen (see also [10]). In Section [6.21 we will analyt- 
ically examine this connection between expansion bias and 
community reach. On the other hand, the SEC method ap- 
pears to be among the least effective in reaching different 
communities or clusters. We attribute this to the fact that 
SEC preferentially selects nodes with many connections to 
nodes already sampled. Such nodes are likely to be mem- 
bers of clusters already represented in the sample. Second, 
on the DQ measure, it is surprising that the DS strategy, 
which explicitly selects high degree nodes, often fails to even 
come close to the XS strategy. We partly attribute this to an 
overlap in the neighborhoods of well-connected nodes. By 
explicitly selecting nodes that contribute to expansion, the 
XS strategy is able to discover a much larger proportion of 
the network in the same number of steps - in some cases, 
by actively sampling comparatively lower degree nodes. Fi- 
nally, it is also surprising that the BFS strategy, widely used 
to crawl and explore online social networks (e.g [41]) and 
other graphs (e.g. [H]), performs quite dismally on all three 
measures. In short, we find that nodes contributing most to 
the expansion of the sample are unique in that they provide 
specific and significant advantages over and above those pro- 
vided by nodes that are simply well-connected and those ac- 
cumulated through standard BFS-based crawls. These and 
previously mentioned results are in contrast to the conven- 
tional wisdom followed in much of the existing literature 

(e.g. 0ESUHSHHI12]). 

5.4 A Note on Seed Sensitivity 

As described, link-trace sampling methods are initiated 
from randomly selected seeds. This begs the question: How 
sensitive are these results to the seed supplied to a strat- 
egy? Figure [5] shows the standard deviation of each sam- 
pling strategy for both hub inclusion and network reach as 
sample size grows. We generally find that methods with the 
most explicit biases (XS, SEC, DS) tend to exhibit the least 
seed sensitivity and variability, while the remaining meth- 
ods (BFS, DFS, FFS, RW) exhibit the most. This trend is 
exhibited across all measures and all datasets. 

6. ANALYZING SAMPLING BIASES 

Let us briefly summarize two main observations from Sec- 
tion [5] We saw that the XS strategy dramatically outper- 
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Figure 4: Evaluating network reach. Results for remain- 
ing networks are similar with XS exhibiting superior perfor- 
mance on all three criteria. 
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Figure 5: Standard deviation for DQ and Hubs on Epinions 
network. Results are similar for remaining networks. 



formed all others in accumulating nodes from many differ- 
ent communities. We also saw that the SEC strategy was 
often a reasonably good approximation to directly sampling 
high degree nodes and locates the set of most well-connected 
nodes significantly faster than most other methods. Here, 
we turn our attention to analytically examining these ob- 
served connections. We begin by briefly summarizing some 
existing analytical results. 

6.1 Existing Analytical Results 

Random Walks (RW). There is a fairly large body of re- 
search on random walks and Markov chains (see [37) for an 
excellent survey). A well-known analytical result states that 
the probability (or stationary probability) of residing at any 
node v during a random walk on a connected, undirected 
graph converges with time to ^%ti , where d v is the degree 
of node v [2JJ. In fact, the hitting time of a random walk 
(i.e. the expected number of steps required to reach a node 
beginning from any node) has been analytically shown to 
be directly related to this stationary probability 24 . Ran- 
dom walks, then, are naturally biased towards high degree 
(and high PageRank) nodes, which provides some theoreti- 
cal explanation as to why RW performs slightly better than 
other strategies (e.g. BFS) on measures such as hub inclu- 
sion. However, as shown in Figure [1] it is nowhere near 



the best performers. Thus, these analytical results appear 
only to hold in the limit and fail to predict actual sampling 
performance. 

Degree Sampling (DS). In studying the problem of search- 
ing peer-to-peer networks, Adamic et al. [5] proposed and 
analyzed a greedy search strategy very similar to the DS 
sampling method. This strategy, which we refer to as a 
degree-based walk, was analytically shown to quickly find 
the highest-degree nodes and quickly cover large portions of 
scale-free networks. Thus, these results provide a theoretical 
explanation for performance of the DS strategy on measures 
such as hub inclusion and the discovery quotient. 
Other Results. As mentioned in Section [2l to the best of 
our knowledge, much of the other analytical results on sam- 
pling bias focus on negative results .1, 15, 30, 31 , 46 . Thus, 
these works, although intriguing, may not provide much help 
in the way of explaining positive results shown in Section [S] 

We now analyze two methods for which there are little or 
no existing analytical results: XS and SEC. 

6.2 Analyzing XS Bias 

A widely used measure for the "goodness" or the strength 
of a community in graph clustering and community detec- 
tion is conductance .26., which is a function of the fraction 
of total edges emanating from a sample (lower values mean 
stronger communities): 
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where ay are entries of the adjacency matrix representing 
the graph and a(S) = Z)ies2 
number of edges incident to the node set S 

It can be shown that, provided the conductance of com- 
munities is sufficiently low, sample expansion is directly af- 
fected by community structure. Consider a simple random 
graph model with vertex set V and a community struc- 
ture represented by partition C = {Ci, . . . ,C\c\} where 
C'i U . . . U C\a\ = V. Let ej n and e ut be the number of each 
node's edges pointing within and outside the node's com- 
munity, respectively. These edges are connected uniformly 
at random to nodes either within or outside a node's com- 
munity, similar to a configuration model (e.g., [11]). Note 
that both ei n and e ou t are related directly to conductance. 
When conductance is lower, e ou t is smallei0 as compared to 
ei n . The following theorem expresses the link between ex- 
pansion and community reach in terms of these inward and 
outward edges. 

Theorem 1. Let S be the current sample, v be a new 
node to be added to S, and n be the size of v 's community. 

Ife ou t < n (\v\+l n \s\) ' ^ en th e expected expansion of SU{v} 
is higher when v is in a new community than when v is in 
a current community. 



3 Suppose conductance of a vertex set Y is <p(Y), the total 
number of edges incident to Y is e, and d n and e ou t are 
random variables denoting the inward and outward edges, 
respectively, of each node (as opposed to constant values). 

Then, E(e out ) = 2^1 and E(e in ) = 2e ^ Y)) . If <p(Y) < 

^, then E(e ou t) < IE(ej n ). (In this example, the expectations arc 
over nodes in Y only.) 



Proof. Let X nem be the expected value for |iV({u}) — 
N(S) U S\ when v is in a new community and let X curr be 
the expected value when not. We compute an upper bound 
on X CU rr and a lower bound on X new . 

Deriving X CUT r'- Assume v is affiliated with a current com- 
munity already represented by at least one node in S. Since 
we are computing an upper bound on X curr , we assume 
there is exactly one node from S within v's community, as 
this is the minimum for v's community to be a current com- 
munity. By the linearity of expectations, the upper bound 



on Xcurr is eout ~T~ 



■, where the term 



the expected number of nodes in v's community that are 
both linked to v and in the set V - (N(S) U 5"). 

Deriving X new : Assume v belongs to a new community not 
already represented in S. (By definition, no nodes in S will 
be in v's community.) Applying the linearity of expectations 
once again, the lower bound on Xnew is e» n — e ou t\S\j^, 
where the term e ou t|S| t& is the expected number of nodes 
in v's community that are both linked to v and already in 
N(S). 



Solving for e ou t, if e out < 
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Theorem \T\ shows analytically the link between expansion 
and community structure - a connection that, until now, has 
only been empirically demonstrated 40 . Thus, a theoretical 
basis for performance of the XS strategy on community reach 
is revealed. 

6.3 Analyzing SEC Bias 

Recall that the SEC method uses the degree of a node v 
in the induced subgraph Gsu{v} as an estimation for the de- 
gree of v in G. In Section[5] we saw that this choice performs 
quite well in practice. Here, we provide theoretical justifi- 
cation for the SEC heuristic. Consider a random network G 
with some arbitrary expected degree sequence (e.g. a power 
law random graph under the so-called G(w) model [11] ) and 
a sample S C V . Let d(-, •) be a function that returns the 
expected degree of a given node in a given random network 
(see QT| for more information on expected degree sequences). 
Then, it is fairly straightforward to show the following holds. 

Proposition 1. For any two nodes v,w e N(S), 
ifd(v,G) > d(w,G), then d(v,G SU {v}) > d{w,G Su{w }). 

Proof. The probability of an edge between any two nodes 
i and j in G is g&gHfeg) w here A = E me v d(m, G). Let 
S = d(v, G S u{v}) - d(w, G S u{w})- Then, 

d(x,G) -d{v,G) ^ d(x,G) ■ d(w,G) 
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A ^ A 
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(d(v,G)-d(w,G))J2 
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(2) 



Since 5 > only when d(v,G) > d(w,G), the proposition 
holds. □ 

Combining Proposition [1] with analytical results from [2] 
(described in Section [6 . 1 1) provides a theoretical basis for ob- 
served performance of the SEC strategy on measures such 
as hub inclusion. Finally, recall from Section f5. 1.21 that the 



extent to which SEC matched the performance of DS on 
Hubs seemed to partly depend on the tail of degree dis- 
tributions. Proposition [JJ also yields insights into this phe- 
nomenon. Longer and denser tails allow for more "slack" 
when deviating from these expectations of random variables 
(as in real- world link patterns that are not purely random). 

7. APPLICATIONS FOR OUR FINDINGS 

We now briefly describe ways in which some of our find- 
ings may be exploited in important, real- world applications. 
Although numerous potential applications exist, we focus 
here on three areas: 1) Outbreak Detection 2) Landmarks 
and Graph Exploration 3) Marketing. 

7.1 Practical Outbreak Detection 

What is the most effective and efficient way to predict and 
prevent a disease outbreak in a social network? In a recent 
paper, Christakis and Fowler studied outbreak detection of 
the H1N1 flu among college students at Harvard Univer- 
sity [10]. Previous research has shown that well-connected 
(i.e. high degree) people in a network catch infectious dis- 
eases earlier than those with fewer connections [131 I17|[5T] . 
Thus, monitoring these individuals allows forecasting the 
progression of the disease (a boon to public health officials) 
and immunizing these well-connected individuals (when im- 
munization is possible) can prevent or slow further spread. 
Unfortunately, identifying well-connected individuals in a 
population is non-trivial, as access to their friendships and 
connections is typically not fully available. And, collect- 
ing this information is time-consuming, prohibitively expen- 
sive, and often impossible for large networks. Matters are 
made worse when realizing that most existing network-based 
techniques for immunization selection and outbreak detec- 
tion assume full knowledge of the global network structure 
(e.g. [361148] ). This, then, presents a prime opportunity to 
exploit the power of sampling. 

To identify well-connected students and predict the out- 
break, Christakis and Fowler [10] employed a sampling tech- 
nique called acquaintance sampling (ACQ) based on the 
so-called friendship paradox 10, 13, 51 . The idea is that 
random neighbors of randomly selected nodes in a network 
will tend to be highly- connected |13lll7|[5"i"] . Christakis and 
Fowler [10] . therefore, sampled random friends of randomly 
selected students with the objective of constructing a sam- 
ple of highly-connected individuals. Based on our afore- 
mentioned results, we ask: Can we do better than this ACQ 
strategy? In previous sections, we showed empirically and 
analytically that the SEC method performs exceedingly well 
in accumulating hubs. (It also happens to require less infor- 
mation than DS and XS, the other top performers.) Figure[6] 
shows the sample size required to locate the top-ranked well- 
connected individuals for both SEC and ACQ. The perfor- 
mance differential is quite remarkable, with the SEC method 
faring overwhelmingly better in quickly zeroing in on the 
set of most well-connected nodes. Aside from its superior 
performance, SEC has one additional advantage over the 
ACQ method employed by Christakis and Fowler. The ACQ 
method assumes that nodes in V can be selected uniformly 
at random. It is, in fact, dependent on this [13] . (ACQ, 
then, is not a link-trace sampling method.) By contrast, 
SEC, as a pure link-trace sampling strategy, has no such re- 
quirement and, thus, can be applied in realistic scenarios for 
which ACQ is unworkable. 
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Figure 6: Comparison of SEC and ACQ on quickly locat- 
ing well-connected individuals (lower is better). SEC far 
surpasses ACQ. Results are similar for every network. 

7.2 Marketing 

Recall from Section 15.31 that a community in a network 
is a cluster of nodes more densely connected among them- 
selves than to others. Identifying communities is important, 
as they often correspond to real social groups, functional 
groups, or similarity (both demographic and not) 18 . The 
ability to easily construct a sample consisting of members 
from diverse groups has several important applications in 
marketing. Marketing surveys often seek to construct strat- 
ified samples that collectively represent the diversity of the 
population [28] . If the attributes of nodes are not known in 
advance, this can be challenging. The XS strategy, which 
exhibited the best community reach, can potentially be very 
useful here. Moreover, it has the added power of being able 
to locate members from diverse groups with absolutely no 
a priori knowledge of demographics attributes, social vari- 
ables, or the overall community structure present in the net- 
work. There is also recent evidence to suggest that being 
able to construct a sample from many different communities 
can be an asset in effective word-of-mouth marketing [8]. 
This, then, represents yet another potential marketing ap- 
plication for the XS strategy. 

7.3 Landmarks and Graph Exploration 

Landmark-based methods represent a general class of algo- 
rithms to compute distance-based metrics in large networks 
quickly [44]. The basic idea is to select a small sample of 
nodes (i.e. the landmarks), compute offline the distances 
from these landmarks to every other node in the network, 
and use these pre-computed distances at runtime to approx- 
imate distances between pairs of nodes. As noted in [44] , for 
this approach to be effective, landmarks should be selected 
so that they cover significant portions of the network. Based 
on our findings for network reach in Section 15.31 the XS 
strategy overwhelmingly yields the best discovery quotient 
and covers the network significantly better than any other 
strategy. Thus, it represents a promising landmark selection 
strategy. Our results for the discovery quotient and other 
measures of network reach also yield important insights into 
how graphs should best be explored, crawled, and searched. 
As shown in Figure [4] the most prevalently used method 
for exploring networks, BFS, ranks low on measures of net- 
work reach. This suggests that the BFS and its pervasive 
use in social network data acquisition and exploration (e.g. 
see [UJ) should possibly be examined more closely. 

8. CONCLUSION 

We have conducted a detailed study on sampling biases 
in real-world networks. In our investigation, we found the 
BFS, a widely-used method for sampling and crawling net- 
works, to be among the worst performers in both discovering 



the network and accumulating critical, well-connected hubs. 
We also found that sampling biases towards high expan- 
sion tend to accumulate nodes that are uniquely different 
from those that are simply well-connected or traversed dur- 
ing a BFS-based strategy. These high- expansion nodes tend 
to be in newer and different portions of the network not 
already encountered by the sampling process. We further 
demonstrated that sampling nodes with many connections 
from those already sampled is a reasonably good approxi- 
mation to sampling high degree nodes. Finally, we demon- 
strated several ways in which these findings can be exploited 
in real-world application such as disease outbreak detection 
and marketing. For future work, we intend to investigate 
ways in which the top-performing sampling strategies can 
be enhanced for even wider applicability. One such direc- 
tion is to investigate the effects of alternating or combining 
different biases into a single sampling strategy. 
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